การหาเส นทางในเคร อข ายเซ นเซอร ไร สายเคล อนท สาหร บช วการแพทย ด วย ร อ นฟอร สเมนท เล ร นน ง โดยใช ทร สท และเร บพ วเทช น นางสาวญาน นะพ ทธะ

Size: px

Start display at page:

Download "การหาเส นทางในเคร อข ายเซ นเซอร ไร สายเคล อนท สาหร บช วการแพทย ด วย ร อ นฟอร สเมนท เล ร นน ง โดยใช ทร สท และเร บพ วเทช น นางสาวญาน นะพ ทธะ"

Bertram Phelps
5 years ago
Views:

1 การหาเส นทางในเคร อข ายเซ นเซอร ไร สายเคล อนท สาหร บช วการแพทย ด วย ร อ นฟอร สเมนท เล ร นน ง โดยใช ทร สท และเร บพ วเทช น นางสาวญาน นะพ ทธะ ว ทยาน พนธ น เป นส วนหน งของการศ กษาตามหล กส ตรปร ญญาว ศวกรรมศาสตรมหาบ ณฑ ต สาขาว ชาว ศวกรรมโทรคมนาคม มหาว ทยาล ยเทคโนโลย ส รนาร ป การศ กษา 2555

2 RL-BASED ROUTING IN BIOMEDICAL MOBILE WIRELESS SENSOR NETWORKS USING TRUST AND REPUTATION Yanee Naputta A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Telecommunication Engineering Suranaree University of Technology Academic Year 2012

3 RL-BASED ROUTING IN BIOMEDICAL MOBILE WIRELESS SENSOR NETWORKS USING TRUST AND REPUTATION Suranaree University of Technology has approved this thesis submitted in partial fulfillment of the requirements for a Master s Degree. Thesis Examining Committee (Asst. Prof. Dr. Peerapong Uthansakul) Chairperson (Asst. Prof. Dr. Wipawee Hattagam) Member (Thesis Advisor) (Asst. Prof. Dr. Paramate Horkaew) Member (Prof. Dr. Sukit Limpijumnong) Vice Rector for Academic Affairs (Assoc. Prof. Flt. Lt. Dr. Kontorn Chamniprasart) Dean of Institute of Engineering

4 ญาน นะพ ทธะ : การหาเส นทางในเคร อข ายเซ นเซอร ไร สายเคล อนท สาหร บช วการแพทย ด วยร อ นฟอร สเมนท เล ร นน ง โดยใช ทร สท และเร บพ วเทช น (RL-BASED ROUTING IN BIOMEDICAL MOBILE WIRELESS SENSOR NETWORKS USING TRUST AND REPUTATION) อาจารย ท ปร กษา : ผ ช วยศาสตราจารย ดร.ว ภาว ห ตถกรรม, 68 หน า. เคร อข ายเซ นเซอร ทางด านช วการแพทย ได กลายเป นกระบวนการท ม ศ กยภาพในการเฝ า ระว งด านส ขภาพของคนได ท งท บ านและท โรงพยาบาล การประย กต ใช เซ นเซอร ทางด านช ว - การแพทย น เหมาะสมอย างย งส าหร บผ ส งอาย และผ ท พลภาพท ต องการเคล อนไปไหนมาไหน มากกว าถ กจ าก ดให อย ในสถานท เฉพาะ เคร อข ายด งกล าวจะช วยให การเฝ าระว งส ขภาพในด าน ข อม ลทางสร รว ทยาของผ ป วยเป นไปได อย างต อเน อง โดยเซ นเซอร จะถ กต ดอย ก บต วของผ ป วย และส งข อม ลเหล าน นกล บไปย งศ นย การแพทย และเพ อสน บสน นการประย กต ใช งานทางด านช ว การแพทย น พาราม เตอร ทางด านประส ทธ ภาพของเคร อข าย เช น อ ตราความส าเร จในการส งแพ ค เก ต เวลาในการส งข อม ลจากต นทางไปถ งปลายทาง จะต องเป นไปตามความต องการได เพ อให แน ใจว าแพ กเก ตข อม ลสามารถถ กส งออกไปย งศ นย การแพทย อย างไรก ตาม ในสถานการณ ท สมจร งมากข น บางโหนดไม ยอมให ความร วมม อก บโหนดอ น เช น ไม ยอมส งต อแพ กเก ตท ได ร บมา อาจเป นเพราะแบตเตอร หมด โหนดช าร ดหร อท างานผ ดปกต โดยไม ทราบสาเหต ซ งจะท าให ประส ทธ ภาพของเคร อข ายลดลง ด งน น ว ตถ ประสงค ของงานว จ ยน จ งน าเสนอการปร บปร งว ธ การหาเส นทางในเคร อข าย เซ นเซอร ไร สายเคล อนท ทางด านช วการแพทย โดยใช การบ รณาการของอ ลกอร ธ มเร ยนร แบบร อ น ฟอร สเมนท (reinforcement learning; RL) เข าก บกระบวนการของทร สท และเร บพ วเทช น เร ยกว า ค วอาร ท และท าการเปร ยบเท ยบก บว ธ การเด มท ม อย แล วซ งเร ยกว าอ ลกอร ธ มอาร แอล-ค วอาร พ (reinforcement learning based routing protocol; RL-QRP) และอ ลกอร ธ มท ไม ม การเร ยนร เร ยกว า เทรสโฮลด อ ลกอร ธ ม การจ าลองสถานการณ ต างๆถ กทดลองภายใต เง อนไขของการเคล อนท ของ โหนด การไม ร วมม อของโหนด และเง อนไขของเวลาในการส งแพ กเก ตจากต นทางไปปลายทางท ต องการ งานว จ ยช นน ได ศ กษามาตรช ว ดประส ทธ ภาพของการหาเส นทางสามอย าง ค อ ค าเฉล ย อ ตราความส าเร จในการส งข อม ล (average success ratio) ค าเฉล ยของเวลาในการส งแพ กเก ตจาก ต อนทางไปปลายทาง (average end-to-end delay) และจ านวนของเส นทางท พบในแต ละความยาว ของเส นทาง (number of discovered path for each path length) ผลการทดลองแสดงให เห นว า ค วอาร ท อ ลกอร ธ มท น าเสนอสามารถให ประส ทธ ภาพส ง กว าอ ลกอร ธ มอาร แอลค วอาร พ ท ม อย แล วและเทรสโฮลด อ ลกอร ธ มในเทอมของค าเฉล ยอ ตรา

5 ความส าเร จในการส งข อม ลภายใต เง อนไขของโหนดท ไม ให ความร วมม อ ส งถ ง 11% และ 25% ตามลาด บ ภายใต เง อนไขของโหนดท ม การเคล อนท ส งถ ง 9% และ 22% ตามล าด บ ย งไปกว าน น ในกรณ ของเง อนไขเวลาในการส งแพ กเก ตจากต นทางไปปลายทางท ต องการ ค วอาร ท อ ลกอร ธ มม ค าเฉล ยอ ตราความส าเร จในการส งของม ลมากกว าอาร แอล-ค วอาร พ อ ลกอร ธ มถ ง 11% ซ งจากผล การทดลองในการทดลองของเราช ให เห นว าว ธ การทร สท และเร บพ วเทช นสามารถน ามา ประย กต ใช เพ อปร บปร งการหาเส นทางในเคร อข ายเซ นเซอร ไร สายเคล อนท ท ม โหนดซ งไม ให ความร วมม ออย ในเคร อข ายให ม ประส ทธ ภาพมากข นภายใต การประย กต ใช เวลาในการส งข อม ล จากต นทางไปย งปลายทางท จาก ด II สาขาว ชาว ศวกรรมโทรคมนาคม ลายม อช อน กศ กษา ป การศ กษา 2555 ลายม อช ออาจารย ท ปร กษา

6 YANEE NAPUTTA : RL-BASED ROUTING IN BIOMEDICAL MOBILE WIRELESS SENSOR NETWORKS USING TRUST AND REPUTATION. THESIS ADVISOR : ASST. PROF. WIPAWEE HATTAGAM, Ph.D., 68 PP. MOBILE WIRELESS SENSOR NETWORKS/ REINFORCEMENT LEARNING/ TRUST AND REPUTATION/ ROUTING/NON-COOPERATIVE Biomedical Sensor Networks have become a potential solution for monitoring health of people in their home and at hospital. Their application is especially suitable for elderly and disabled people who may prefer to be on-the-move, rather than constrained in a particular area. Such networks allow continuous monitoring of the patient s physiological information. Sensors are attached to the body and relayed back to the medical center. To support such application, network performance metrics such as packet delivery ratio, end-to-end delay must be satisfied to ensure that data packets can be routed and reliably delivered to the medical center. However, in a more realistic scenario some nodes do not cooperate with each other (i.e. by dropping packets they receive) either due to node battery depletion, malfunctioning or simply misbehaving for unknown reason thereby degrading network performance. The underlying aim of this research is therefore to propose an enhancement to a RL-based routing in biomedical mobile wireless sensor networks by integrating it with trust and reputation, called QRT, and compare it to an existing scheme which has been used to find optimal path through experience and reward for biomedical sensor network, called reinforcement learning based routing protocol (RL-QRP) algorithm and a non-learning algorithm called the threshold. Simulations were conducted under

7 IV different mobility, malicious nodes and end-to-end delay requirement conditions. The routing performance metrics studied in this research were of average success ratio, average end-to-end delay and the number of discovered path for each path length. The experiments results showed that proposed QRT algorithm can outperform existing RL-QRP algorithms and the threshold scheme in terms of average success ratio by up to 11% and 25%, respectively in the malicious node variation case, and up to 9% and 22%, respectively in the node mobility variation case. Furthermore, in the end-to-end delay requirement case, QRT gained 11% up to over RL-QRP algorithm. The results in our experiments suggest that trust and reputation can be applied to improve routing in presence of malicious nodes in mwsns with stringent end-to-end delay requirements applications. School of Telecommunication Engineering Academic Year 2012 Student s Signature Advisor s Signature

8 ACKNOWLEDGEMENT I am grateful to all those, who by their direct or indirect involvement have helped in the completion of this thesis. First and foremost, I would like to express my sincere thanks to my thesis advisor, Asst. Prof. Dr. Wipawee Hattagam for her invaluable help and constant encouragement throughout the course of this research. I am most grateful for her teaching and advice, not only the research methodologies but also many other methodologies in life. I would not have achieved this far and this thesis would not have been completed without all the support that I have always received from her. In addition, I am grateful for the lecturers in School of Telecommunication Engineering for their suggestion and all their help. I would also like to express my thanks to Dr. Kae Hsiang Kwong, a senior research fellow of University of Strathclyde, Scotland, for granting me the opportunity to do research in Scotland. I would also like to thank Asst. Prof. Dr. Peerapong Uthansakul and Asst. Prof. Dr. Paramate Horkaew for accepting to serve in my committee. My sincere gratitude goes to the Telecommunication Research Industrial and Development Institute (TRIDI), National Telecommunication Commission Fund, Thailand for the scholarship throughout my studies and for the fruitful discussions and insights received from all the progress update meetings. My sincere appreciation goes to Ms. Pranitta Arthans for her valuable administrative support during the course of my dissertation.

9 VI Finally, I am most grateful to my parents and my friends both in both masters and doctoral degree courses for all their support throughout the period of this research. Yanee Naputta

10 VII TABLE OF CONTENTS Page ABSTRACT (THAI) ABSTRACT (ENGLISH) ACKNOWLEDGEMENTS TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES SYMBOLS AND ABBREVIATIONS I III V VII XI XII XIV CHAPTER I INTRODUCTION Significance of the Problem Research Objectives Research Hypothesis Basic Agreements Scope and Limitation Research Methodology Progressions Research Methodology Research Location Research Equipments 8

11 VIII TABLE OF CONTENTS (Continued) Page Data Collection Data Analysis Expected Benefit Organization of Thesis 9 II BACKGROUND THEORY Introduction Markov Decision Process Theory Markov Property Markov Decision Process Policy Reinforcement Learning The Value Function The Optimal Value Function Q-learning Exploration Trust and Reputation Representation and Update: Binary Ratings Reputation and Update: Interval Rating Trust Summary 27

12 IX TABLE OF CONTENTS (Continued) Page III RL-based Routing in Biomedical Mobile Wireless Sensor Networks using Trust and Reputation Introduction Reinforcement Learning based Routing Protocol with QoS Support for Biomedical Sensor Networks (RL-QRP) Reputation RL-QRP with Trust and Reputation Performance Evaluation Unconstrained Traffic Demand Part 1 Malicious Nodes Effect Part 2 Mobility Effect Traffic Demand with End-to-End Delay QoS Conclusion 50 IV CONCLUSION AND FUTURE WORK Conclusion QRT Quality-of-Service Future Work mwsns with Indirect Reputation Value Traffic Priority Performance Evaluation of Test Bed 55

13 X TABLE OF CONTENTS (Continued) Page mwsns with Energy Consumption Condition 55 REFERENCES 56 APPENDIX A PUBLICATION 61 BIOGRAPHY 68

14 LIST OF TABLES Table Page 3.1 QRT Routing Algorithm Simulation Parameters Simulation Parameters 44

15 LIST OF FIGURES Figure Page 2.1 A MDP model Diagram of agent-environment interaction in reinforcement learning RL-QRP routing model Average success ratio of discovered paths Average end-to-end delay of discovered paths Number of discovered paths length for 9 malicious nodes Average success ratio under various degrees of mobility Average end-to-end delay of discovered paths under various degrees of mobility Average number of discovered path length under various degrees of mobility Average success ratio under different end-to-end delay requirements and probability of malicious node = Average success ratio under different end-to-end delay requirements and probability of malicious node = Average end-to-end delay under different end-to-end delay requirements and probability of malicious node = Average end-to-end delay under different end-to-end delay requirements and probability of malicious node =

16 XIII LIST OF FIGURES (Continued) Figure Page 3.12 Number of discovered path under different end-to-end delay requirements = 100 msec and probability of malicious node = Number of discovered path under different end-to-end delay requirements = 100 msec and probability of malicious node = Number of discovered path under different end-to-end delay requirements = 200 msec and probability of malicious node = Number of discovered path under different end-to-end delay requirements = 200 msec and probability of malicious node =

17 SYMBOLS AND ABBREVIATIONS WSNs = Wireless sensor networks ECG = Electrocardiogram mwsn = Mobile wireless sensor network MAC = Media access control GPS = Global positioning system RL = Reinforcement learning QoS = Quality-of-service RFSN = Reputation based framework for sensor network RL-QRP = A reinforcement learning based routing protocol with QoS support for biomedical sensor networks MDP = Markov decision process C = Criticality of the routing device MDP = Markov decision process t = Time step index = Learning rate S t = State of the process at time t S = State space s = Current state s = Next state A = Action space a = Action

18 XV SYMBOLS AND ABBREVIATIONS (Continued) E[] = Expectation operator = Discount factor R( s, a, s) = Expected reward given any current state s and an action a with any next state s ' r = Reward = Policy * = Optimal policy P[A] = Distribution over the action space Q ( s, a) = The action-value function of a given policy associates to a t state-action pair ( sa, ) at time t R t = Expected discounted return of the agent at time t E [] = Expectation operator under policy V () s = Value function of a state (s) under policy V * () s = Value function of a state (s) under optimal policy Q *( s, a) = * The action-value function of a given optimal policy associates to state-action pair ( sa, ) i = Class of message θ = Reputation value p(θ) = Prior distribution Г(.) = Gamma function

19 XVI SYMBOLS AND ABBREVIATIONS (Continued) D(δ) = Dirichlet process δ = Base measure T ij = Trust metric R ij = Reputation metric ( ) = Quality of action a at state s γ = Discount factor ( ) = The expectation future reward at state s by taking action a Ds i,s sink = The distance between node s i and destination node Ds j,s sink = The distance between node s j and destination node Ds i,s j = The distance between node s i and node s j T Q = The end-to-end delay requirement Tdelay s i,s j = The experience delay between node s i and s j N = The number of sensor nodes p = The number of success event n = The number of failures event l ij = Level of trust at node s j experienced by s i r = Reward function

20 CHAPTER I INTRODUCTION This chapter introduces a background on routing problems in biomedical mobile wireless sensor networks and highlights the significance of improving routing performance in such networks. It also presents the motivation for applying trust and reputation with reinforcement learning to provide a good routing solution which is the main focus of this thesis. 1.1 Significance of the Problem A wireless sensor networks (WSN) is a network of small devices, called sensor nodes that are embedded in the real world to collect measurements of interest, e.g., humidity in the air, soil moisture, temperature of environment, ph, etc. There are numerous applications for wireless sensor networks, e.g., battlefield surveillance, medical care, wildlife monitoring and disaster response. In this research, we are interested in biomedical wireless sensor networks which measure vital sign parameters such as body temperature, blood pressure, electrocardiogram (ECG), pulse oximeters and heart rate, etc. These parameters are sensed at a patient and transmitted to a base station at a medical center. The data is used for health status monitoring, diagnosis, treatment and further analysis. For example, Varshney, (2008) and Jovanov, (2009) proposed the use of wireless sensors to monitor vital signs of patients in a hospital environment.

21 2 In medical sensor networks used for monitoring disabled/elderly patients, sensor nodes are attached to a patient s body for physiological information. In case of emergency, patients may be moved to an emergency room, or disabled/elderly patients may be on the move in the hospital, medical staff may want to know their information continuously. Therefore, a mobile wireless sensor network system (mwsn) is necessary for biomedical sensor networks. Ref. Ying Hong Wang, (2008) and Nguyen, Defago, Beuran and Shinoda (2008) conducted some initial study on the overall network lifetime in mwsns. Mobility can further aggravate delay problems as currents paths become disconnected, new paths must be found for replacement. Most of the fundamental characteristics of mobile wireless sensor networks are the same as that of normal static WSNs. Some major differences, however, are as follows. 1) Due to the mobility, mobile WSNs have a much more dynamic topology compared to static WSNs. It is often assumed that a sink will move continuously in a random fashion, thus making the whole network dynamic. 2) It can be reasonably assumed that a gateway sink has an unlimited energy computation and storage resources. The depleted batteries of mobile sinks can be recharged or changed with fresh ones and mobile sinks have access to computational and storage devices. 3) The increased mobility in the case of mobile WSN imposes some restrictions on the already proposed routing and MAC level protocols for WSNs (Zhou, Xing, and Yu, 2006). Most of the protocols in static WSNs perform poorly in the case of mwsns.

22 3 4) Due to the dynamic topology of mwsns, communication links can often become unreliable. This is can be aggravated even further in hostile or remote areas where availability of constant communication channels is low. 5) Because of the mobility, location estimation plays an important role to maintain accurate knowledge of the location of the sinks or nodes. The location of the sinks or nodes can be obtained from GPS (Kim and Hong 2009 ; Yadav, Mishra, and Gore 2009 ; Kim, Lee, Yoon and Han 2009) From the aforementioned works, the design of mobile routing is a significant and challenging field. Nowadays, there are, however, few research in routing in mwsns. A routing technique which suitable for mwsns. Xuedong, Balasingham, and Byun, (2008) applies reinforcement learning which is a distributive, self-adaptive, lightweight mechanism to determine paths in a hop-by-hop manner. Reinforcement learning (RL) is a technique used to support routing in dynamic topology networks. RL is a study of how animals and artificial systems can learn to optimize their behavior by using its experience through rewards and punishments. RL algorithms have been developed to approximate solutions to sequential optimal control problems. In the standard reinforcement learning model, an agent is connected to its environment via state perception and action (Kaelbling, Littman, and Moore, 1996). There are some works which applied RL to solve routing problem in static WSNs (Karaki, and Kamal, 2004; Aghaei, Rahman, and Saddik, 2007; Forster and Murphy, 2007; 2008; wang, 2006; Dong, Agrawal, and Sivalingam, 2007). Apart from routing, some researches Seah, Tham, Srinivasan, and Xin, (2007) and Renaud, and Tham, (2006) used RL to solve coverage problems in static WSNs. Xuedong, Balasingham, and Byun, (2008) proposed a QoS routing scheme in mobile wireless

23 4 sensor networks for biomedical sensor networks. In their research, they investigated the impact of network traffic load and sensor node mobility on the network performance. However, they considered cooperative mwsns. But as aforementioned, a more realistic scenario would require consideration of situations which some nodes do not cooperate with others. Most routing or packet forwarding schemes in the previous literature assume that nodes function properly, are trustworthy and cooperative. However, in realistic scenarios, nodes may fail to cooperate in the network due to node battery depletion, malfunctioning or simply misbehave for unknown reasons. The most important task of biomedical sensor networks is to ensure that data delivered to the medical center or the destination node. Reputation and trust systems have proven to be useful for detecting misbehaving nodes (faulty or malicious) and for assisting the decision-making process. Reputation systems have been widely studied in the context of several diverse domain such as such as ebay (Resnick, and Zeckhauser, 2000), Yahoo auctions (Resnick et al., 2000), and Internet-based systems such as Keynote (Blaze at al., 1996), maintain reputation metrics at a centralized trusted authority. Some research designed reputation systems for ad-hoc networks i.e., Confidant (Buchegger, and Boudec, 2002) and Core (Michiardi, and Molva, 2002), etc. These systems are distributed and also maintain a statistical representation by borrowing tools from the realms of game theory. These systems try to counter selfish routing misbehavior of nodes by enforcing nodes to cooperate with each other. More recently, reputation systems were proposed in the domain of ad-hoc networks that formulate the problem based on Bayesian analytics rather than game theory (Buchegger, and Boudec, 2003a, 2003b). These systems can counter any arbitrary misbehavior of nodes. There are some works in the area of reputation and trust

24 5 systems for WSNs (Ganerial and Srivastava, 2004; Chen, 2007). Their schemes, a sensor node continuously builds a reputation value for other nodes by monitoring their behavior. Then the sensor node uses this reputation value to evaluate the trustworthiness of other nodes. Tanachaiwiwat, Dave, Bhindwale and Helmy, (2003) propose a mechanism of location-centric isolation of misbehavior and trust routing in energy constrained sensor networks. In their trust model, the trust worthiness value is derived from the capacity of the cryptography availability and packet forwarding. Ganerial and Srivastava, (2004) proposed a reputation based framework for sensor networks (RFSN) based on beliefs. Josang and Knapskog, (1998) in order to derive reputation values where each sensor node develops a reputation for each other node by making direct observations about these other nodes in the neighborhood. Reputation is represented through a Bayesian formulation, more specifically, a beta reputation system and used to help a node evaluate the trustworthiness of other sensor nodes, then, make decisions within the network. Furthermore, the statistical foundations of RFSN algorithm can be reduced to a few basic mathematical operations of addition, subtraction, multiplication and division. So, RFSN can run on resource constrained devices and available as a middleware service on Motes. For these reasons, this research aims to handle routing in non-cooperative biomedical mwsns using a scalable routing mechanism for mwsns as reinforcement learning scheme and integrate with reputation and trust system for detecting and screening for malicious node behavior in mwsns. We also study the effect of mobility, the quantity of malicious nodes and quality-of-service requirements. We finally propose a good optimal routing strategy in mwsns which can handle mobility, malicious and end-to-end delay requirement conditions.

25 6 1.2 Research Objectives 1. To study the effects of RL algorithm on the routing performance in mwsns. 2. To apply reputation and trust systems to solve the routing problem in mwsns and compare with the existing routing algorithm. 3. To study the performance of QoS routing in mwsns. 1.3 Research Hypothesis 1. RL can provide good routing solution in mwsns. 2. Some sensor nodes are uncooperative due to various reasons such as node battery depletion, malfunctioning or simply misbehave for unknown reason. 3. Reputation and trust can avoid misbehaving nodes in mwsns. 1.4 Basic Agreements 1. Visual C++ was used to simulate the routing protocols in mwsns. 2. Some data in the experiments were normalized to facilitate analysis and obtain a conclusion. 1.5 Scope and Limitation 1. RL methods were studied to find a good routing strategy in mwsns. 2. Reputation and trust were studied and applied to RL algorithm in mwsns. Results were compared result with the existing RL-QRP algorithm. 3. Simulations were carried out by Visual C++. The experiment results were analyzed to find a suitable routing strategy for biomedical mwsns.

26 7 1.6 Research Methodology Progressions 1. Review of literature and related theories. 2. Study the existing routing methodologies in mwsns and their performance. 3. Test the proposed reputation and trust systems with RL algorithm by simulation using Visual C++ to solve routing problems in mwsns. 4. Analyze and conclude results. 5. Prepare publication. 6. Write thesis Research Methodology Objective 1: To study routing problems in mwsns. 1. Review literature and related works about routing in mwsns. 2. Determine the advantages and disadvantages of the routing methods chosen as benchmark for this thesis. 3. Apply simulation tools such as Visual C++ to evaluate routing mwsns under various scenarios. 4. Design experiment scenarios evaluate an existing routing algorithm (Xuedong, Balasingham, and Byun, 2008) which used a reinforcement learning method called RL-QRP to find the route. 5. Under various network scenarios, we measured the following parameters to evaluate the performance of RL-QRP in terms of the average success ratio, the average end-to-end delay and number of discovered path for each path length.

27 8 Objective 2: To apply reputation and trust systems with RL-QRP to solve the misbehaving nodes routing problem in mwsns and compare with the original RL-QRP. 1. Survey reputation and trust methods. 2. Add malicious nodes into RL-QRP algorithm. 3. Apply the reputation and trust method to the RL-QRP algorithm. 4. Compare the results with the original RL-QRP algorithm by considering the following parameters, the average success ratio, the average end-toend delay and number of discovered path for each path length. 5. Add QoS condition in terms of end-to-end delay requirement to the network and compare the results with original QRT and RL-QRP algorithms by considering the following parameters, the average success ratio, the average end-toend delay and number of discovered path under different end-to-end delay requirements Research Location 1. Wireless Communication Research and Laboratory, Factory Building 4 (F4), 111 University Avenue, Muang District, Nakhon Ratchasima 30000, Thailand. 2. Centre for Dynamic Intelligent Communications (CIDCOM) within the Department of Electric and Electrical Engineering, Strathclyde University, Royal College Building, 204 George Street Glasgow G1 1XW, Scotland Research Equipments 1. Personal Computer 2. Visual C++ software

28 Data Collection 1. Information collected by reviewing literature and related works. 2. Data collected from Visual C++ simulations Data Analysis The simulation collected data from the sensor nodes were analyzed, compared and concluded in terms of graphs and tables. 1.7 Expected Benefit 1. A suitable routing strategy for mwsns which contain misbehaving nodes. 2. Improved routing reliability in mwsns. 1.8 Organization of Thesis The remainder of this thesis is organized as follows. Chapter 2 presents the theoretical background which underlies the contribution of this thesis. Firstly, an introduction of related works followed by the introduction of Markov decision process theory, reinforcement learning (RL) and Q learning. Finally, the basic theory of reputation and trust which are integrated with the RL process to enhance routing mwsns including malicious node is presented in this thesis. In the first part of Chapter 3, we studied the existing algorithm RL-QRP and formulated of reputation and trust to evaluate the routing performance in mwsns under various mobility and malicious nodes conditions. The proposed algorithm which integrates RL-QRP with reputation and trust called QRT and the original RL- QRP were compared in terms of the average success ratio and the average end-to-end delay. The routing performance results were evaluated and compared between the RL- QRP and QRT algorithm under different conditions of malicious node behavior, mobility and end-to-end delay requirements.

29 10 Chapter 4 summarizes all findings and original contribution in this thesis and points out possible future research directions.

30 CHAPTER II BACKGROUND THEORY 2.1 Introduction This thesis proposed a reinforcement learning based routing mechanism in biomedical mobile wireless sensor networks using trust and reputation. A wireless sensor network (WSN) is a network of small devices, called sensor nodes that are embedded in the real world to collect measurement of the interest. There are numerous applications for wireless sensor networks, e.g., battlefield surveillance, medical care, wildlife monitoring and disaster response. In this research, we are interested in biomedical wireless sensor networks to measure parameters such as body temperature, blood pressure, electrocardiogram (ECG), pulse oximeters (SpO2) and heart rate, are sensed at a patient and transmitted to a base station at a medical center. The main function of biomedical sensor networks is to ensure that data packets can be sensed and delivered to the medical center reliably and efficiently. Thus, routing protocol plays an important role in the communication stacks and has significant impact on the network performance. However, some sensor nodes may do not cooperate which each other. Nodes may drop packets they receive due to node battery depletion, malfunctioning or simply misbehave for unknown reasons. Therefore, the main focus on this thesis is to solve the routing problem for non-cooperative mwsns based on RL by incorporating a reputation and trust mechanism.

31 12 Reinforcement learning (Sutton and Barto, 1998) is the study of how animals or machines can learn to optimize their behavior to obtain rewards and to avoid punishments. This learning scheme can permit a decision maker to learn its optimal decisions (actions) through series of trial-and-error interactions with a dynamic environment. Its main idea is to reinforce good behaviors of the decision maker while discouraging bad behaviors through a scalar reward value returned by the environment. RL relies on the assumption that the dynamics of the system satisfies a Markov decision process (MDP). Q-learning (Watkins, 1989) is a reinforcement learning technique that approximates the optimal action-value function which is a function that gives the expected reward for taking a given action in a given state and following a fixed policy thereafter. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Reputation and trust systems are widely used in diverse domains. E-commerce systems, such as ebay (Resnick and Zeckhauser 2000), Yahoo auctions (Resnick et al. 2000). These systems try to counter selfish routing misbehavior of nodes by enforcing nodes to cooperate with each other. Therefore, this chapter introduces the basic theory of reputation and trust systems and theory behind reinforcement learning. It also serves as an introduction to Q-learning algorithm which is the basis of this thesis. The next section provides a background theory of Markov decision process (MDP), followed by the birth-death process, reinforcement learning (RL) and reputation and trust process. A summary is presented in the final section.

32 Markov Decision Process Theory Markov decision processes (MDPs) is a model of a decision-maker interacting synchronously with the environment. Since the decision-maker sees the environment s true state, it is referred as a completely observable Markov decision process. The basis of Markov decision process is presented as follows Markov Property Markov property refers to the memory-less property of a stochastic process. A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process. The Markov property states that anything that has happened so far can be summarized by the current state S t. Therefore, the probability of being in the next state at time t+1 based on the past history of state changes can be defined simply as the conditional probability based on the current state at time t by; P( S s S s,..., S s ) P( S s S s ). (2.1) t1 t1 t t 0 0 t1 t1 t t This equation is referred to as the Markov property. In other words, a stochastic process has Markov property if the probability distribution of future states of the process time t+1, given the present state at time t and all past states, depends only upon the present state and not on any past states.

33 Markov Decision Process The probability that the process chooses s' as its new state is influenced by the chosen action. Specifically, it is given by the state transition probability function. Thus, the next state s' depends on the current state s and the decision maker's action a. But given s and a, it is conditionally independent of all previous states and actions. In other words, the state transitions of an MDP possess the Markov property. This state transition probability function equation is defined by; P( s s, a) P( S s S s, a a). (2.2) t1 t t Similarly, given any current state and action, s and a, together with any next state, s', the expected value of the incurred reward is; R( s, a, s) E[ r S s, a a, S s] (2.3) t1 t t t 1 where E[.] is the expectation operator and rt 1 is the reward received at time t 1. Equation (2.2) and (2.3), completely specify the most important aspects of the dynamics of the MDP. The simulation programming requires the exact knowledge of these two functions in order to determine the optimal policy. A MDP model can be shown in Fig Figure 2.1 A MDP model.

34 15 A Markov decision process is a 4-tuple (S, A, P, R) which can describe the MDP characteristics, where S denotes the set of states, A is a finite set of actions, P is the probability that action a in state s at time t will lead to state s' at time t + 1, R is the immediate reward (or expected immediate reward) received after transition to state s' from state s after having taken action a A. Let P( s s, a) P be the state transitioning model that denotes the probability of transiting to the next state s S after an agent takes action a A at the current state s S Policy A policy, is a description of the behavior of a decision-maker, or a function mapping states to actions, : S A. There are two types of policies. A stationary policy is a situation-action mapping, i.e., it specifies an action to be taken at each state. The choice of action depends only on the state and is independent of the time step. A non-stationary policy, on the other hand, is a sequence of situation-action mappings, indexed by time. In this thesis, we focus on stationary policies since our data acquisition problem is based on models of sensor readings which are obtained in a particular time frame, such as in the mornings, afternoons, etc. Hence, within such period, the model maybe considered stationary hence the policy is also assumed stationary. The objective of solving a MDP is to find a policy,, defined as a mapping of the state space to the action space, : S P[ A], where P[A] is the distribution over the action space. The action-value function Q ( s, a) of a given policy associates a state-action pair ( sa, ) with an expected reward for performing t action a in state s at time step t and policy.

35 16 To achieve this objective, particularly in scenarios where the dynamics of the environment is difficult to model (such as in mwsns), a technique called reinforcement learning can be used to solve MDPs. 2.3 Reinforcement Learning Reinforcement learning (RL) is a computational approach which is concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward. In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference from these classical techniques is that reinforcement learning algorithms do not need the knowledge of the MDP and they target large MDPs where exact methods become infeasible. The learner is not taught which action to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trial-and-error interactions with its environment (Sutton and Barto, 1998). A reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives an observation, which typically includes the reward r t. It then chooses an action a t from the set of actions available. The environment then moves to a new state s t+1 and the reward r t+1 associated with the transition (s t, a t, s t+1 ) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. Figure 2.3 shows the agent-environment interaction in reinforcement learning.

17 Figure 2.2 Diagram of agent-environment interaction in reinforcement learning. 2.3.1 The Value Function Define the value function V () s of a policy π by; V ( s) E R s s t t E r s s k tk 1 t, (2.

36 17 Figure 2.2 Diagram of agent-environment interaction in reinforcement learning The Value Function Define the value function V () s of a policy π by; V ( s) E R s s t t E r s s k tk 1 t, (2.4) k 0 2 k where R r r r... r is the expected discounted return of the t t1 t2 t3 tk1 k0 agent, is the discount factor which 0 1 and E [] is the expectation operator under policy. Similarly, the action-value function Q ( s, a) of a given policy associates a state-action pair ( sa, ) with an expected reward for performing action a in state s at time step t and following thereafter; t Q ( s, a) E R s s, a a t t t t E r s s a a tk 1 t, t. (2.5) k 0

37 The Optimal Value Function Solving a reinforcement learning task means, roughly, finding a policy that achieves the maximum reward over the long run. The optimal value function denoted as V () s which is defined as the maximum state value function over all possible policies, at state s. V ( s) max V ( s). (2.6) Optimal policies also share the same optimal action-value function, denoted Q ( s), and defined by; Q ( s) max Q ( s, a). (2.7) The standard solution to the problem above is through an iterative search method (Puterman 1994) that searches for a fixed point of the following Bellman equation; V ( s) max Rt P( s s, a) V ( s). (2.8) a s The equation (2.9) is a form of the Bellman optimality equation for V () s. The Bellman optimality equation for Q () s is; Q( s) Rt P( s s, a) max Q ( s, a). (2.9) s a

38 Q-learning Q-learning is a reinforcement learning technique that works by learning an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Q-learning (Sutton and Barto, 1998) defines a learning method within a MDP that is employed in single-agent RL systems. Q-learning is an algorithm that does not need a model of the environment and can directly approximate the optimal action-value function (Q-value) through online learning. Assume that the learning agent exists in an environment described by some set of possible states s S. It can perform any of the possible actions a A. The interaction between the agent and the environment at each instant consists of the following sequence; The agent senses the state s S. t Based on s t, the agent performs an action a t A. As a result, the environment makes a transition to the new state s s S t1. The agent receives a real-valued reward (payoff) r t that indicates the immediate reward value of this state-action transition. The task of the agent is to learn a policy, : S A, for selecting its next action a ( s ) Q t t based only on the current state s t. For a policy, the Q-value ( s, a) (or state-action value) is the expected discounted cost for executing action a at state s and then following policy thereafter. The optimal policy * () s is the policy that maximizes the total expected discount reward which received over an

39 20 infinite time. The Q-learning process tries to find * * Q ( s, a) Q ( s, a) in a recursive manner using available information s a s a r where s t and ' ' ( t, t,,, t ) ' s are the states at time t and t 1 respectively, a t and ' a are the actions at time t and t 1, respectively, and r t is the immediate reward due to a t. The Q-learning rule at time step t 1 is given by; Q ' ' t1 ( st, at ) (1 ) Qt ( st, at ) rt max Qt ( s, a ) a' (2.10) where 0 1 is a discount factor, 0 1 is the learning rate and Q ( s', a ') is the action-value function for next state ' s and next action a '. t Exploration One of the most important issues for Q-learning algorithm is maintaining a balance between exploration and exploitation. Normally, the convergence theorem of Q-learning requires that all state-action pairs ( sa, ) are tried infinitely (Sutton and Barto, 1998). Such a balanced condition is satisfied by selecting a good action according to some probability and exploring new actions, otherwise. Note that is the probability that a greedy action is selected i.e.; a* arg max Q( s, a). (2.11) a A This probability termed greedy, significantly speeds up the convergence of the Q-value function. If the Q-value of each admissible ( sa, ) pair is visited infinitely often, and if the learning rate is decreased to zero in suitable way, then as

40 21 t, Qt ( s, a ) converges to The optimal policy is defined by; Q * ( s, a) with probability 1 (Sutton and Barto, 1998). * * ( s) arg max Q ( s, a). (2.12) aa( s) 2.5 Trust and Reputation In this section, we describe techniques for estimating a reputation θ based on transactional data. A transaction occurs whenever two nodes make an exchange of information or participate in collaborative process. With each exchange, the nodes generate ratings indicating the degree of cooperation of their partner node. For the moment, we consider reputation θ representing the probability that a given node will cooperate when asked to exchange information. Therefore, our reputations θ are contained in the unit interval [0,1], and values of θ closer to one suggest greater cooperation. In the next two section, we discuss a Bayesian framework for updating reputations given the rating from each new transaction. Within this section we address the following topics: representation of reputation update with new transactions and a trust metric as output of the reputation Representation and Update: Binary Ratings. Suppose a transaction occurs between node i and j. Depending on the outcome, the node i will assign the value 1 if node j was cooperative and 0 otherwise. Node i will then update its reputation for node j, incorporating this new data. Independently, node j will create its own rating for the exchange and update its opinion of node i. For simplicity, we will focus on the computations carried out by node i with the understanding that each node in the network will perform similar operations after it completes a transaction.

41 22 Let θ denote the reputation of node j held by node i. We adopt a classical betabinomial framework for estimating reputations (Gelman et al.2003; Josang and Ismail 2002). Specifically, we assign to θ a prior distribution p(θ) that reflects our uncertainty about the behavior of node j before any transactions with i take place. We will take p(θ) from the beta family, a two-parameter class of distributions which can expressed as; ( ) ( ) ( ) ( ) ( ) (2.13) For some choice of α and β, where ( ) is the gamma function (Gelman et al. 2003). The mean of a beta distribution with parameter ( ) is ( ) and its variance is ( ) ( ). The beta is chosen, in part, because of its flexible and ability to peak at any value in the interval [0,1] with arbitrarily small variance (Gelman et al. 2003). Given θ we then model our binary rating as Bernoulli observations with success probability θ. That is, let denote node i s rating of node j for a single transaction. Then, given j s reputation θ, the probability that node j will be cooperative is; ( ) ( ) (2.14) Once the transaction is complete, we update our reputation using the posterior distribution for θ; ( ) ( ) ( ) ( ) ( ) ( ) ( ) (2.15)

42 23 In our case, these expressions become; ( ) ( ) ( ) (2.16) which means the posterior ( ) again has a beta distribution with parameters and. The utility of the choice of a beta distribution is now clear because of its relationship with the Bernoulli (binomial) distribution; the beta distribution is the conjugate prior for the bernoulli distribution. Therefore, our reputation framework requires node i to maintain only two parameters to describe the reputation of node j with very simple update rules as each new transaction occurs. Suppose nodes i and j now conduct n transactions with rating. Repeating the updates in the previous paragraph, we find that the posterior distribution for θ after n transactions is again beta with parameters updated as follows; (2.17) Therefore, after n transactions, the posterior mean of θ is ( ), (2.18)

43 24 where ( ) ( ) is a probability that tends to zero as. This form of the updates shows clearly that we are doing a weighted average of the prior mean and the mean of the new observations. The weight on the prior mean goes to zero as the number of new observations grows very large Reputation and Update: Interval Rating. Now we describe an update for rating that are not measured on a binary scale but instead are assigned some value in [0,1]. We can think of these rating as estimated probabilities, perhaps for the event that a particular data point exchanged between i and j is faulty. Note that the notion of estimated probabilities is much more consistent than binary ratings. In this context, we appeal to a slightly more elaborate framework involving Dirichlet processes (Ferguson 1973). Let ( ) be a Dirichlet process with base measure and let this be our prior distribution. Given observations, (Ferguson 1973) tells us that posterior is again a Dirichlet process with base measure ( ) ( ), where I is an indicator of a point mass at the location of the observation. As we will describe in section 2.5.3, we are ultimately interested in the posterior trust, i.e. the posterior mean of the reputation distribution. When the prior mean is given by, the posterior mean of the posterior mean of the Dirichlet process is given by; ( ) (2.19) where ( ) ( ( )) tends to zero as and ( ) ( ) is mean of the base measure. Suppose we take ( ). Then we have;

44 25 (2.20) which, even though we now are dealing with real-valued observations on the interval [0,1], gives the same weights as in section 2.5.1, where we had binary cooperativeness rating. In fact, in order to match not just the weight but also the prior mean, we could take our measure to be ( ) ( ) and get exactly the same updating as in Equation 2.18 with real-valued variables instead of binary variables. Once we have seen that the update is of a generalizable form using the Dirichlet Process, we can also see that update using binary rating in section can also derived within this framework. If we let the measure, which would suggest our data are binary, then the update for the mean is again exactly Equation We can now see that this justification is a very general one. Following from this discussion, in order to maintain our two parameters in a way so that we correctly update the posterior mean, we replace the bayesian update step with an identical bookkeeping step. After a single transaction, if the assigned probability of cooperativeness were, the beta parameter updates would be; (2.21) Trust The main objective of the reputation block is to expose as output metric that can be used as a representative of the subjective expectation of the other node s future behavior. Up until now we have represented i s reputation of node j

45 26 with θ, but from here on we represent it with to make the pairwise reputations more explicit. Given a reputation metric, we define the trust metric as node i s prediction of the expected future behavior of node j. is obtained by taking a statistical expectation of this prediction; [ ] [ ( ] (2.22) This trust metric can be used by a node in several ways. Some notable ones are: (1) Data Fusion: can be used as a weight for a data reading reported by node j. The data fusion can be then performed on these weighted data readings, thereby reducing the impact of untrustworthy nodes. (2) Node revocation: The evolution of trust over time provides an on-line tool to the end-user to detect compromised or faulty nodes. This can help the end-user to take appropriate countermeasure such as replacing the misbehaving node or sensor. (3) Decentralized decision making: In a heterogeneous sensor network, different nodes might be equipped with different capabilities. For example, a few of them might have a more precise temperature sensor or a camera, others may be mobile, etc. Given a requirement of using a particular service from some other node in the network and faced with multiple choices, the value of can be used as a decision making criteria.

46 Summary In this chapter, an overview of Q-learning which is a reinforcement learning method has been introduced. We provided a concise background on theories related to reinforcement learning including the Markov decision process. Furthermore, we also presented an overview of reputation and trust systems. In the next chapter a reinforcement learning based routing in biomedical mobile wireless sensor networks using trust and reputation is presented and its routing performance is compared with an existing algorithm.

47 CHAPTER III RL-BASED ROUTING IN BIOMEDICAL MOBILE WIRELESS SENSOR NETWORKS USING TRUST AND REPUTATION 3.1 Introduction In this chapter, routing issues in biomedical wireless sensor networks are investigated. Parameters such as body temperature, blood pressure heart rate are sensed at a patient and transmitted via intermediate sensor nodes to a base station at a medical center. The data is used for health status monitoring, diagnosis and treatment. For example Z. Pang, Q. Chen, and L. Zheng, (2009), E. Jovanov, C. Poon, Y. Guang- Zhong, and Y.T. Zhang, (2009) proposed the use of wireless sensors to monitor vital signs of patients in hospital and home environments. The most important task of biomedical sensor networks is to ensure that data can be delivered to the medical center reliably and efficiently (R.S.H. Istepanian, E. Jovanov, Y.T. Zhang, 2004). Furthermore, in biomedical sensor networks, patients may be moved to an emergency room, and medical staff may want to know their information continuously. Therefore, use of a mobile wireless sensor network (mwsn) is necessary for biomedical sensors networks. A distributed, lightweight, and highly adaptive routing protocol based on methods such as reinforcement learning (RL) has been proposed for such rapidly changing wireless network conditions (E. Gelenbe and M. Gellman, 2007), (L. Xuedong, I. Balasingham, and S.S. Byun, 2008).

48 29 RL is a technique that has been used to support routing in dynamic topology networks. RL is a study of how artificial systems can learn to optimize their behavior by using its experience through rewards and punishments. There are some works which applied RL to solve routing problem in static WSNs (A. Forster, A.L. Murphy, J. Schiller, and K. Terfloth, 2008). In (E. Gelenbe and M. Gellman, 2007), the authors proposed a Cognitive Packet Network (CPN) which made routing decisions in presence of routing oscillations using RL and a neural network model. Ref. (L. Xuedong, I. Balasingham, and S.S. Byun, 2008) proposed RL-QRP, a RL-based routing protocol with routing scheme in mwsns. They investigated the impact of network traffic load and sensor node mobility on the network performance. However, their results were based on the assumption that all nodes cooperated in the packet forwarding process. But a more realistic scenario would require consideration of situation which some nodes do not cooperate with each other (i.e., by dropping packets they receive) either due to node battery depletion, malfunctioning or simply misbehaving for unknown reason (U. Vashney, 2008). Since in biomedical sensor networks, data packets must be delivered to its destination node reliably, means to identify and avoid these malicious nodes are necessary (D. He, C. Chen, S. Chan, J. Bu, and A. Vasilakos, 2012). Reputation and trust schemes have been used to identify well-behaved and malicious nodes in WSNs (D. He, C. Chen, S. Chan, J. Bu, and A. Vasilakos, 2012), (H. Yu, Z. Shen, C. Miao, C. Leung, and D. Niyato, 2010). In such schemes, a sensor node continuously builds a reputation value for other nodes by monitoring their behavior. Then the sensor node uses this reputation value to evaluate the trustworthiness of other nodes. Ref. D. He, C. Chen, S. Chan, J. Bu, and A. Vasilakos,

49 30 (2012) proposed a trust scheme called ReTrust for medical WSNs which is lightweight and attack-resistant. High malicious node detection rates and average packet delivery ratio were achieved via simulation and experimental test-bed. However, sensor node mobility was not explicitly addressed. Therefore, the objective of this chapter is to solve the routing problem for noncooperative mwsns based on RL by incorporating a reputation and trust mechanism which screens out nodes with malicious behavior using values of reputation and trust values maintained at the sensor nodes. We compared its performance with an existing reinforcement learning routing scheme called RL-QRP (L. Xuedong, I. Balasingham, and S.S. Byun, 2008) under various mobility and malicious node scenarios. 3.2 RL-QRP Reinforcement Learning based Routing Protocol with QoS Support for Biomedical Sensor Networks (RL-QRP) has been proposed for promote routing policies to find optimal path through experience and rewards (L. Xuedong, I. Balasingham, and S.S. Byun, 2008). They used Q-learning which learns the value of function ( ) to find an optimal decision policy. In each time action is selected, the agent receives an immediate reward from the environment. Then the agent will use this reward to update the one step rule as follows; ( ) ( ) ( ) [ ( )] (3.1) where the Q-value, ( ) denotes the quality of action at state, is the learning rate and is the discount factor ( ) denotes the expectation future reward at state

50 31 by taking action. The updated Q-values then in turn affect the future decisions of the agent. RL-QRP requires the use of location information parameters to calculate a reward following a particular action. Therefore, the protocol can find the shortest path from a beginning node to a destination node using a reward function given by; ( ( ) ) ( ) { (3.2) where and is the distance between node, and destination node, respectively, is the distance between node and node, is the endto-end delay requirement encapsulated in the data packet. is the experience delay between node and. Sa Sink Node Sn Sd a1 a4 a2 Sj Sz Si a3 So Sc Figure 3.1 RL-QRP routing model

51 32 The basic idea of RL-QRP follows Figure 3.1. Each node in the biomedical sensor network is considered as a state belonging to set S = { }, = 1,2,,N where N is the number of sensor nodes. For each node with a neighbor, an action can be selected from A = { ( )}. Note that ( ) refers to a packet being forwarded from state to, provided that and are within each s other communication range. Suppose that node in Figure 3.1 must forward a packet to the sink node through some intermediate node. then checks the Q-value of its neighboring nodes which include. Then node forwards the packet to the neighbor node with the highest Q-value. Suppose that forwards the packet to node. After that node updates its Q-value ( ( )) according to (3.1) with reward in (3.2). The process is repeated for node and the following consecutive nodes until the packet reaches the sink node. Thus, the nodes can find the optimal route through experience and rewards without complicated prediction techniques, or explicitly frequently updating. Therefore, this process is well-suited for dynamic topologies. 3.3 Reputation Reputation and trust systems have been proved useful mechanisms to address the threat of compromised or faulted entities. Such systems are operated by identifying selfish peers and excluding these entities from the networks. Ref. S. Buchegger and J.- Y. Le Boudec, (2002) considered routing protocols in MANETS by using both first hand and second hand information for updating reputation values. Ref. S. Ganeriwal and M. B. Srivastava, (2004) and D. He, C. Chen, S. Chan, J. Bu, and A. Vasilakos, (2012) considered both first and second hand reputation and trust-based models

52 33 developed exclusively for sensor networks. In D. He, C. Chen, S. Chan, J. Bu, and A. Vasilakos, (2012), a two-tier architecture trust management scheme was proposed in which a master node was used to compute the trust values for sensor nodes within its range. In (S. Ganeriwal and M. B. Srivastava, 2004), a watchdog mechanism was used to build their trust rating system. Given a reputation value obtained from the watchdog, the trust metric based on BETA distribution (H. Yu, Z. Shen, C. Miao, C. Leung, D. Niyato, 2010) can be computed by; [ ] (3.3) where refers to node s prediction of the expected future behavior of positive outcomes of node, are the number of positive and negative outcomes of a specific event, respectively. is refer to a reputation metric. In particular, and are the number of successes and failures in forwarding packets between two nodes, respectively. The first hand or direct reputation value can be determined from which is the direct observation of node (the observed node) experienced by node. From figure 3.1, suppose that node prefers to forward the data packet to the destination node by the shortest path via node and. In effect, an interaction occurs between node and node. We used a simple reputation binary rating scheme, where a successful outcome ( ) is incremented if node forwards the packet to node and a failed outcome ( ) is incremented if node does not forward the packet to node. Note that typically so that the trust value is normalized to the range [ ] and the initial value of trust is 0.5. On the other hand, the indirect reputation value can be determined from direct reputation values of node recommended by its

53 34 neighboring nodes. Although aggregated second hand information (i.e. by inquiring from watchdog the values of of other nodes which interacted with node in the past) helps accelerate the calculation of the reputation value, this chapter considers the first hand observation or direct reputation for the sake of simplicity. Furthermore, drawbacks of indirect reputation include vulnerability to bad-mouthing attacks and that watchdog may not be able to capture all relevant information in the network (H. Yu, Z. Shen, C. Miao, C. Leung, D. Niyato, 2010). 3.4 RL-QRP with Reputation and Trust In this section, RL-based routing integrated with reputation and trust, called QRT, is described. We redefine the state and action and rewards as follows: a) Let Q( ) denote the opinion of about which is updated when node forwards or drops packets to its neighboring node; ( ) {( ) ( ) [ ( ( ))]} (3.4) where the Q-value, ( ), denotes the quality of forwarding packets at node experienced by and denotes the level of trust at node experienced by which is quantized into intervals of 0.1. A trust value which takes values in the range [0,1]. b) State: S = { }, = 1,2,,N where N is the number of sensor nodes. Each node is a state in S.

54 35 c) Trust: is the trust value that quantifies the trustworthiness of in forwarding packets from node that we integtated the original Q-value of RL-QRP algorithm by average between Q-value and trust value. d) Action: { ( )},, Excution of ( ) means that the packet is forwarded from state to, provided that and are within each other s communication range. e) Reward function: is the reward for executing an action at node (e.g. forwards the packet to ) given by; ( ( ) ) ( ) (3.5) Note that we assumed that every node in the network always sends ACK back to its upstream node, regardless of their behavior. and are the distance between node, and the destination node, respectively. is the distance between node and node. is the end-to-end delay requirement encapsuled in the data packet. is the experienced delay between node and. The pseudo code of the proposed QRT routing algorithm is shown in Table 3.1.

55 36 TABLE 3.1 QRT routing algorithm 01 Begin 02 Initialization 03 Set timer for beacon exchange 04 Begin Loop 05 If timer expires 06 Broadcast beacon to immediate neighboring nodes 07 Re-set timer 08 Endif 09 If beacon packets arrives 10 Update neighboring node s position and Q-value 11 Endif 12 If data packet arrives 13 If good node 14 Random number 15 If Random number > ε 16 Select neighboring node with highest Q-value 17 Else 18 Randomly select neighboring node 19 End if 20 Receive reward r 21 Update the Q-value 22 Update Trust 23 Else 24 Drop packets 25 End if 26 Endif 27 Go to End 3.5 Performance Evaluation In this section, we evaluated the proposed QRT routing algorithm which integrated the existing RL-QRP (L. Xuedong, I. Balasingham, and S.S. Byun, 2008) with the reputation and trust scheme. Results were compared with the original RL- QRP and a non-learning threshold reputation scheme. The latter scheme ranked the trust values of the neighboring nodes and selected the next node with the highest trust value above a predetermined threshold of 0.4 which was found to give the best performance among other threshold values. Visual C++ was used to simulate a mwsn

56 37 under various conditions according to Table 3.2 and Table 3.3. A number of nodes in the mwsn were mobile and followed the random way point mobility model which is suitable for modeling user s mobility in a confined area or within the hospital. The velocity was randomly chosen from [0,5] m/s. The remaining nodes were assumed static. These parameters are suitable for biomedical applications, where each node represents a patient who is attached with a health monitor sensor node. Each experiment was repeatedly run with different seeds, each with a runlength of 10 6 events until the sample averaged results were within a 10% range Unconstrained Traffic Demand Initially, we evaluated the routing performance of the algorithms when there is no constrained on the QoS of the traffic demand. This experiment was divided into 2 parts where we considered the cases when the node mobility was varied and when the number of malicious nodes present in the network was varied Part 1 Malicious Nodes Effect In this experiment, there are 9 mobile nodes out of 36 nodes. To study the effect of malicious nodes and the degree to which they misbehave, the number of malicious node was varied from 9 to 18 nodes and their packet dropping probability were varied from 0 to 1. The following metrics were measured:

57 38 TABLE 3.2 Simulation Parameters Parameters Value Part 1 Part 2 Number of sensor nodes 36 Node mobility Random way point Pause time (s) 60 Node velocity (m/s) Min. 0, Max. 5 Area size 200x200m 2 Transmission range 50m Runlength (number of route requests) 10 6 Learning rate (α) for RL-QRP, QRT 0.5 Discount factor (γ) for RL-QRP, QRT 0.5 Number of mobile nodes 9 0,9,18,27,36 Number of malicious nodes 9, 18 9 Probability of dropping a packet 0, 0.25,0.5,0.75, Average success ratio (%) is given by; (3.6) This metric is the proportion of number of successfully discovered paths. Figure 3.2 illustrates the average success ratio for QRT, RL-QRP and threshold schemes as the packet dropping probability was varied. Note that for all packet dropping probabilities, the average success ratio of QRT was up to 11% greater than RL-QRP and up to 25% greater than the threshold scheme. Such result indicated that

58 39 QRT can identify and avoid malicious nodes more effectively than RL-QRP and threshold schemes and thereby discover more paths that can reach the destination node. Average end-to-end delay: In Figure 3.3, the average end-to-end delay is shown against the packet dropping probability. Note that the QRT showed a higher average end-to-end delay than RL-QRP. This was because QRT can discover more paths than the other schemes as shown in the previous figure. In Figure 3.4, such paths included both short paths (2, 3 hops) which was comparable to the RL-QRP, as well as long paths (4 hops up) which was discovered significantly greater than RL-QRP. The threshold scheme discovered the least number of shortest paths of all thus obtaining the highest average end-to-end delay. Average success ratio (%) QRT (9 Mal. Nodes) QRT (18 Mal. Nodes) RL-QRP (9 Mal. Nodes) RL-QRP (18 Mal. Nodes) Threshold (9 Mal. Nodes) Threshold (18 Mal. Nodes) Packet dropping probability Figure 3.2 Average success ratio of discovered paths

59 40 Average end-to-end delay (msec) QRT (9 Mal. Nodes) QRT (18 Mal. Nodes) RL-QRP (9 Mal. Nodes) RL-QRP (18 Mal. Nodes) Threshold (9 Mal. Nodes) Threshold (18 Mal. Nodes) Packet dropping probability Figure 3.3 Average end-to-end delay of discovered paths Part 2 Mobility Effect In this part, the algorithms performance when varying node mobility was investigated. For this scenario, 9 malicious nodes were present, each with a packet dropping probability of Such setting was used because high success ratio were observed for all schemes. Hence, the effect from increased mobility would be more visible. The degree of mobility was varied by increasing the number of moving nodes from 0 (least mobile) to 36 (most mobile).

60 41 6 x hops 3 hops 4 hops up Number of discovered path length QRT RL-QRP Threshold Figure 3.4 Number of discovered paths for each path length for 9 malicious nodes Average success ratio (%): Figure 3.5 illustrates the average success ratio for all schemes. Note that QRT consistently outperformed both RL-QRP and threshold schemes by up to 9% and 22%, respectively. However, the margin between QRT and RL-QRP decreased as mobility increased. Average end-to-end delay: In Figure 3.6, the average end-to-end delay is shown versus the number of moving nodes. Similar to Figure 3.3, the average end-toend delay of QRT was greater than RL-QRP but less than the threshold scheme. This was because, in Figure 3.7, QRT can find more longer paths (4 hops up) than RL-QRP and the threshold scheme, while obtaining a comparable number of short paths (2, 3 hops) to RL-QRP. Furthermore, as the number of discovered paths gradually decreased as mobility increased, QRT consistently discovered more paths than other schemes.

61 42 Average success ratio (%) QRT RL-QRP Threshold Number of moving nodes Figure 3.5 Average success ratio under various degrees of mobility Average end-to-end delay (msec) Number of moving nodes QRT RL-QRP Threshold Figure 3.6 Average end-to-end delay of discovered paths

62 43 under various degrees of mobility 7 x hops 3 hops 4 hops up Number of discovered paths for each path length QRT RL-QRP Threshold QRT RL-QRP Threshold QRT RL-QRP Threshold QRT RL-QRP Threshold QRT RL-QRP Threshold Number of moving nodes Figure 3.7 Average number of discovered path length under various degrees of mobility Traffic Demand with End-to-End Delay QoS In this experiment, there are 9 mobile nodes and 9 malicious nodes present in the 36 node mwsn. To study the impact on the QoS on the network, the end-to-end delay requirement ( ) was varied to 50, 100, 200, 300 msec. The remaining simulation parameters are shown in Table 3.3.

63 44 TABLE 3.3 Simulation Parameters Parameters Value Number of sensor nodes 36 Node mobility Random way point Pause time (s) 60 Node velocity (m/s) Min. 0, Max. 5 Area size 200x200m 2 Transmission range 50m Runlength (number of route requests) 10 6 Learning rate (α) for RL-QRP, QRT 0.5 Discount factor (γ) for RL-QRP, QRT 0.5 Number of mobile nodes 9 Number of malicious nodes 9 Probability of dropping a packet 0, 0.5 End-to-end delay requirement (msec) 50,100, 200, 300 Average success ratio In Figures 3.8 and 3.9, the average success ratio is shown against end-to- end delay requirement ( ). In this experiment, we modified the proposed QRT and the existing RL-QRP to handle different stringent end-to-end delay requirements. In particular, the reward function ( ) was modified by varying accordingly for both algorithms. We thus refer to them as QRT_ reward and RL-QRP_ reward,

64 45 respectively. Furthermore, we also evaluated a more aggressive approach in finding paths to meet the end-to-end delay requirements by allowing the agents in both algorithms to search for next hops only on paths which have the estimated delay so far not exceeding the end-to-end delay requirement. Such modification discovers paths which strictly satisfy the QoS requirement, therefore we refer to them as QRT_strict and RL-QRP_strict, respectively. The value of was varied in the range msec. We considered the cases when the packet dropping probability were 0 and 0.5. From Figures 3.8 and 3.9, we can see that in QRT consistently outperform RL-QRP. In addition, the average success ratio QRT_ reward and RL-QRP_ reward are greater than QRT_strict and RL-QRP_strict. The reason was because QRT _ reward and RLQRP_ reward cannot screen out the paths whose path delay exceed the end-to-end delay requirement as shown in Figures Furthermore, the average success ratio of QRT_strict and RL-QRP_strict decreased as became more stringent because these two methods conservatively filter out paths that have delay more than.

65 Average success ratio (%) QRT strict 20 QRT Tq reward RL-QRP strict RL-QRP Tq reward End-to-end delay requirement (msec) Figure 3.8 Average success ratio under different end-to-end delay requirements and 0 probability of malicious node Average success ratio (%) QRT strict QRT Tq reward RL-QRP strict RL-QRP Tq reward End-to-end delay requirement (msec) Figure 3.9 Average success ratio under different end-to-end delay requirements and 0.5 probability of malicious node

66 47 Average end-to-end delay In Figures 3.10 and 3.11, the average end-to-end delay is shown against the end-to-end delay requirement when the packet dropping probability is 0 and 0.5, respectively. Note that the average end-to-end delay of QRT and RL-QRP are similar. The average end-to-end delay of QRT_strict and RL-QRP_strict strictly satisfy because these schemes select only paths whose delays are not over. However, QRT_ reward and RL-QRP_ reward cannot screen out such paths delays by means of reward modification alone. Average end-to-end delay (msec) QRT strict 40 QRT Tq reward RL-QRP strict RL-QRP Tq reward End-to-end delay requirement (msec) Figure 3.10 Average end-to-end delay under different end-to-end delay requirements and 0 probability of malicious node

67 Average end-to-end delay (msec) QRT strict QRT Tq reward RL-QRP strict RL-QRP Tq reward End-to-end delay requirement (msec) Figure 3.11 Average end-to-end delay under different end-to-end delay requirements and 0.5 probability of malicious node Packet dropping probability of 0, 100 msec QRT_strict RL-QRP_strict QRT_ reward RL-QRP_ Figure 3.12 Number of discovered path under different end-to-end delays reward

68 49 Packet dropping probability of 0.5, msec 100 QRT_strict RL-QRP_strict QRT_ reward RL-QRP_ reward Figure 3.13 Number of discovered path under different end-to-end delays Packet dropping probability of 0, msec 200 QRT_strict RL-QRP_strict QRT_ reward RL-QRP_ Figure 3.14 Number of discovered path under different end-to-end delays reward

69 50 Packet dropping probability of 0.5, 200 QRT_strict RL-QRP_strict QRT_ reward RL-QRP_ reward Figure 3.15 Number of discovered path under different end-to-end delays 3.6 Conclusion We proposed the QRT routing algorithm for non-cooperative mwsns which comprised of malicious stochastic packet dropping nodes. QRT was based on a RL routing method which incorporated a reputation and trust mechanism to screen out malicious nodes. The mechanism employed direct reputation from observed nodes to evaluate their trust values. We compared QRT against RL-QRP and threshold schemes. Results showed that the average success ratio of QRT was 11% and 25% greater than RL-QRP and the heuristic non-learning threshold schemes, respectively. As the mobility of the network increased, QRT consistently outperformed the other algorithms by gaining up to 9% and 22% success ratios above the RL-QRP and threshold schemes. The results suggest that reputation and trust mechanism can be applied to identify and avoid malicious packet dropping nodes mwsns.

70 51 In terms of quality-of-service, the results have shown that QRT consistently outperformed RL-QRP even in presence of high packet dropping probability and stringent end-to-end delay requirements. The results suggest that QRT with reputation and trust mechanism scheme can be applied to cater quality-of-service in mwsns.

71 CHAPTER IV CONCLUSION AND FUTURE WORK 4.1 Conclusion In this thesis, we proposed a routing method called QRT algorithm for noncooperative mwsns based on Reinforcement Learning (RL). In particular, the QRT was integration of a reputation and trust scheme to avoid misbehaving node with an existing RL-based routing protocol called RL-QRP. We evaluate its performance in non-cooperative mwsns under various non-cooperation, mobility and end-to-end delay conditions. The experimental work carried out in this thesis was divided into two parts which were unconstrained and delay-constrained traffic demands. In the first experiment, we varied the number of malicious nodes and the number of mobile nodes to study their impact and compared the results with the original RL-QRP algorithm and a non-learning threshold scheme in terms of average success ratio (%), average end-to-end delay and the number of discovered path length. In the subsequent experiment, we then extended the framework to consider the delay-constrained quality-of-service into our simulation. We considered for 2 types of modification, including QRT_strict and QRT_ reward and also compare the results with the same modifications on RL-QRP in terms of average success ratio (%), average end-toend delay and the number of discovered paths under different end-to-end delay requirements. These two parts were presented in Chapter 3. The original contributions and findings in this thesis can be summarized as follows.

72 QRT The first condition was the proposed QRT scheme which has that Q-learning algorithm can be applied to promote routing in mwsns which include misbehaving nodes. We extended the state space which originally consisted of only the neighboring nodes of an agent to included quantized trust levels of their neighbors as well. We also modified the Q value updating equation (3.4) by adding as an additional reward term which reflected the trust between nodes, in order to take account of the trust level between nodes. Performance comparison was made with an existing RL-QRP algorithm and the threshold scheme. The simulation in the first part varied the number of malicious nodes along with the packet dropping probability of a malicious node. In the second part, the simulation varied the number of mobility nodes. The proposed experiment results showed that the QRT method consistently outperformed RL-QRP and the threshold scheme in terms of success ratio when varying the number of malicious node and achieved up to 11% and 25%, respectively more than the two schemes. QRT method also discovered more longer paths than other schemes. When the number of mobility node increased, QRT gained up to 9% and 22% or success ratio over the RL-QRP algorithm and the threshold scheme, respectively Quality-of-Service The purpose of this section was to add quality-of-service in terms of end-to-end delay requirement into our simulation. In the first part of this study, we modified the end-to-end delay requirement or value in the Q learning equation. Then, the results showed that varying alone cannot screen out path which had end-

73 54 to-end delay more than. An alternative approach was then trialed which selected next hop nodes whose path delay so far has not yet exceeded. The results suggested that QRT performed well in scenarios where end-to-end delay quality-ofservice was required by the traffic demands even in the presence of malicious nodes, achieving up to 11% success ratio than RL-QRP. The significance of our work was focused on proposing means to enhance routing in the presence of misbehaving nodes in mwsns. We studied the effects of mobility and different degrees of malicious node behavior. Moreover, we added quality-of-service into the experiment for a more realistic biomedical application scenario using mwsns. We can conclude that QRT approach can obtain the better routing performance than RL-QRP and the threshold scheme detecting and avoiding malicious nodes in mwsns under various conditions of packet dropping probability, node mobility and stringent end-to-end delay requirements. 4.2 Future Work mwsns with Indirect Reputation Value To study the effect of indirect reputation value which is the opinion about the next node by other neighbor nodes (for example, node i considers forwarding a packet to node j, then node i will get the opinion about node j by node k to evaluate trustworthiness of nodes j), Srinivasan, and Teitelbaum, (2006) proposed the distributed reputation-based beacon trust system (DRBTS) which used both direct reputation and indirect reputation based on beta distribution to weight reward for decision making of the node in choosing the next node. A possible direction for future extension of this thesis is therefore to include indirect reputation in the framework.

74 Traffic Priority In biomedical mobile wireless sensor networks, there is a great variety of health information and the significance of each information is different. Giving priority to important information such as heart rate over delay tolerable traffic such as temperature msec by reserving short routes for only important information to avoid packet collision and over buffering. Hence, traffic and route prioritization are promising directions for further study Performance Evaluation of Test Bed The main objective of this thesis was to improve routing performance in mwsns by using RL with trust and reputation. This experiment was simulated in Visual C++ environment to perform the learning process and evaluate algorithms. Therefore, an important future direction is to extend towards real data collection for training the learning algorithm in actual mwsns mwsns with Energy Consumption Condition Energy consumption in mwsns is one of the most important issues. Dealing with the energy problems in mwsns by expanding the state space of remaining battery of each node and making energy-aware routing decisions at intermediate nodes along the route warrants further investigation.

75 REFERENCES Aghaei, R., Rahman, A., Gueaieb, W., Saddik, A. (2007). Ant Colony-Based Reinforcement Learning Algorithm for Routing in Wireless Sensor Networks. Proceedings of Instrumentation and Measurement Technology Conference. Blaze, M., Feigenbaum, J., Lacy, J. (1996). Decentralized Trust Management. Proceedings of Security and Privacy. Buchegger, S., Boudec. J.Y. (2002). Performance Analysis of the CONFIDANT Protocol (Cooperation of Nodes-Fairness in Dynamic Ad-hoc NeTworks), Proceedings of The Third ACM International Symposium on Mobile Ad Hoc Networking and Computing. Buchegger, S., Boudec. J.Y.L. (2003a). Coping with False Accusations in Misbehavior Reputation Systems for Mobile in Ad-Hoc Networks. Technical Report IC, 2003, 31, EPFL-DI-ICA. Buchegger, S., Boudec. J.Y.L. (2003b). The Effect of Rumor Spreading in Reputation Systems for Mobile Ad-hoc Networks. Proceedings of Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks. Chen, H., Wu, H., Zhou, X., Gao, C. (2007). Reputation-based Trust in Wireless Sensor Networks, Proceedings of Multimedia and Ubiquitous Engineering. Dong, S., Agrawal, P., Sivalingam, K. (2007). Reinforcement Learning Based Geographic Routing Protocol for UWB Wireless Sensor Network. Proceedings of Global Telecommunications Conference.

76 57 Forster, A., Murphy, A.L. (2007). Exploiting Reinforcement Learning for Multiple Sink Routing in WSNs. Proceedings of National Competence Center in Research on Mobile Information and Communication Systems. Forster, A., Murphy, A.L., Schiller, J., Terfloth, K. (2008). An Efficient Implementation of Reinforcement Learning Based Routing on Real WSN Hardware, Proceedings of International Conference on Wireless and Mobile Computing. Ganeriwal, S., Srivastava, M. B. (2004). Reputation based Framework for High Integrity Sensor Networks, Proceedings of Security of Ad Hoc and Sensor Networks. Gelenbe, E., Gellman, M. (2007). Oscillations in a Bio-Inspired Routing Algorithm, Proceedings of Mobile Ad Hoc and Sensor Systems. He, D., Chen, C., Chan, S., Bu, J., Vasilakos, A. (2012). ReTrust: Attack-resistant and Lightweight Trust Management for Medical Sensor Networks, Journal of Information Technology in Biomedicine, vol. 16, no. 4, pp Istepanian, R.S.H., Jovanov, E., Zhang, Y.T. (2004). Guest Editorial Introduction to the Special Section on M-Health: Beyond Seamless Mobility and Global Wireless Health-Care Connectivity, Journal of Information Technology in Biomedicine, vol. 8, no.4, pp Josang, A., Knapskog, S.J. (1998). A Metric for Trust Systems, Proceedings of The 21 st National Information Systems Security Conference. Jovanov, E., Poon, C., Guang-Zhong, Y., Zhang, Y.T. (2009). Guest Editorial Body Sensor Networks: From Theory to Emerging Applications. Journal of Information Technology in Biomedicine, vol. 13, no. 6, pp

77 58 Kaelbling, L.P., Littman, M.L., Moore, A.P. (1996). Reinforcement Learning: A survey. Journal of Artificial Intelligence Research, vol. 4, pp Karaki, J.N., Kamal, A.E., (2004). Routing Techniques in Wireless Sensor Networks: a Survey. Journal of Wireless Communications, vol. 11, no. 4, pp Kim, K., Kim, H. Hong, Y. (2009). A Self Locolization Scheme for Mobile Wireless Sensor Networks., Proceedings of Computer Sciences and Convergence Information Technology. Kim, K., Lee, I.S., Yoon, M., Kim, J., Lee, H., Han, K. (2009). An Efficient Routing Protocol Based on Position Information in Mobile Wireless Area Body Sensor Network. Proceedings of Networks and Communications. Lan Tien Nguyen, Defago, X., Beuran, R., Shinoda, Y. (2008). An Energy Efficient Routing Scheme for Mobile Wireless Sensor Networks. Proceedings of Wireless Communication Systems. Michiardi, P., Molva, R. (2002). Core: A Collaborative Reputation Mechanism to Enforce Node Cooperation in Mobile Ad hoc Networks. Proceedings of Communications and multimedia Security. Pang, Z., Chen, Q., Zheng, L. (2009). A Pervasive and Preventive Healthcare Solution for Medication Noncompliance and Daily Monitoring. Proceedings of Applied Sciences in Biomedical and Communication Technologies. Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming: Wiley-Interscience. Renaud, J.C., Tham, C.K. (2006). Coordinated Sensing Coverage in Sensor Networks using Distributed Reinforcement Learning., Proceedings of International Conference on Networks.

78 59 Resnick, P., Zeckhauser, R. (2000). Trust among strangers in Internet transactions: Empirical analysis of ebays Reputation System. Journal of The Economics of the Internet and E-commerce: Advance in applied microeconomics, vol. 11, pp Resnick, P., Kuwabara, K., Zeckhauser, R., Friedman, E. (2000). Reputation systems. The Article of Communications of the ACM, vol. 43, no.12, pp Seah, M.W.M., Tham, C.K., Srinivasan, V., Xin, A. (2007). Achieve Coverage through Distributed Reinforcement Learning in Wireless Sensor Networks. Proceedings of Intelligent Sensor, Sensor Networks and Information. Sutton, R., Barto, A. (1998). Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning): The MIT Press. Tanachaiwiwat, S., Dave, P., Bhindwale, R., Helmy, A. (2003). Location-centric Isolation of Misbehavior and Trust Routing in Energy-Constrained Sensor Networks. Proceedings of Performance, Computing and Communications. Vashney, U. (2008). Improving Wireless Health Monitoring Using Incentive-Based Router Cooperation, IEEE Computer Magazine, vol. 41, no. 5, pp Wang, P., Wang, T. (2006). Adaptive Routing for Sensor Networks using Reinforcement Learning. Proceedings of Computer and Information Technology. Watkins, C Learning from Delayed Rewards. University of Cambridge, England.

79 60 Xuedong, L., Balasingham, I., Byun, S.S. (2008). A Multi-agent Reinforcement Learning based Routing Protocol for Wireless Sensor Networks. Proceedings of Wireless Communication Systems. Xuedong, L., Balasingham, I., Byun, S.S. (2008). A Reinforcement Learning based Routing Protocol with QoS Support for Biomedical Sensor Networks. Proceedings of Applied Sciences on Biomedical and Communication Technology. Yadav, V., Mishra, M.K., Gore, M.M. (2009). Localization Scheme for Three Dimensional Wireless Sensor Networks Using GPS enabled Mobile Sensor Nodes. Journal of Next-Generation Networks, vol. 1, no. 1, pp Ying-Hong, W., Chin-Yung, Y., Wei-Ting, C., Chun-Xuan W. (2008). An Average Energy based Routing Protocol for Mobile Sink in wireless sensor networks. Proceedings of Ubi-Media Computing. Yu, H., Shen, Z., Miao, C., Leung, C., Niyato, D. (2010). A Survey of Trust and Reputation Management Systems in Wireless Communications. Journal of the IEEE, vol. 98, no. 10, pp Zhou, Y., Xing, J., Yu, Q. (2006). Overview of Power-efficient MAC and Routing Protocols for Wireless Sensor Networks. Proceedings of Mechatronic and Embedded Systems and Applications.

80 APPENDIX PUBLICATION

81 Publication Naputta, Y., and Usaha, W. (2012). RL-based Routing in Biomedical Mobile Wireless Sensor Networks using Trust and Reputation. The 9 th International Symposium on Wireless Communication Systems (ISWCS), France, August 2012.

82 63

83 64

84 65

85 66

86 67

ว ธ การต ดต ง Symantec Endpoint Protection

ว ธ การต ดต ง Symantec Endpoint Protection 1. Download File ส าหร บการต ดต ง 2. Install Symantec Endpoint Protection Manager 3. Install License 4. Install Symantec Endpoint Protection Client to Server